Olist was founded in 2015 and was relatively new business when they released their dataset on Kaggel. We have observations from Sept 2016 to Aug 2018. For a relatively new business it is very critical to understand how their business is growing and how they can be prepared to scale and maintain it further.
Keeping this in mind, we are trying to forecast the sales for Olist so that they can channelize their resources for stock preparation, marketing campaigns and get better prepared for future.
1. Importing Libraries </br> 2. Importing Data</br> $\;\;\;\;$ 2.1 Data dictionary </br> $\;\;\;\;$ 2.2 Processing data for time series </br> 3. Explorartory Data Analysis</br> $\;\;\;\;$ 3.1 General exploration </br> $\;\;\;\;$ 3.2 Decomposing time series </br> $\;\;\;\;$ 3.3 Checking stationarity </br> 4. Preparation for Modeling</br> $\;\;\;\;$ 4.1 Train and test split </br> $\;\;\;\;$ 4.2 Defining functions for plotting predictions and forecast </br> $\;\;\;\;$ 4.3 Defining functions for evaluation </br> 5. Modelling (SARIMA) </br> $\;\;\;\;$ 5.1 Plotting ACF and PACF plot </br> $\;\;\;\;$ 5.2 Applying SARIMA model </br> $\;\;\;\;$ 5.3 Plotting predictions and evaluating SARIMA model </br> $\;\;\;\;$ 5.4 Adding Exogenous variable holiday for SARIMAX </br> $\;\;\;\;$ 5.5 Applying grid search on SARIMAX with exogenous variable </br> $\;\;\;\;$ 5.6 Plotting predictions and evaluating SARIMAX model </br> 6. Modelling (FB Prophet) </br> $\;\;\;\;$ 6.1 Preparing data for FB Prophet </br> $\;\;\;\;$ 6.2 Applying a Baseline FB Prophet </br> $\;\;\;\;$ 6.3 Plotting and Evaluating Baseline model </br> $\;\;\;\;$ 6.4 Tuning FB Prophet using Grid Search </br> $\;\;\;\;$ 6.5 Plotting and evaluating Tuned FB Prophet </br> 7. Modelling (XG Boost Regression) </br> $\;\;\;\;$ 7.1 Preparing data for XG Boost Regression </br> $\;\;\;\;$ 7.2 Baseline model on daily data and its performance </br> $\;\;\;\;$ 7.3 Tuning XG Boost and evaluating and its performance </br> 8. Modelling (LSTM) </br> $\;\;\;\;$ 8.1 Preparing data for LSTM </br> $\;\;\;\;$ 8.2 Baseline model on daily data and its performance </br> 9. Discussing issues encounted with hourly sampled data </br> 10. Conclusion
#importing basic libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style= 'darkgrid' )
#some built in functions
import itertools
from datetime import datetime, timedelta
import warnings
#to supress warning generated for fb prophet using .append method by default
# warnings.simplefilter(action='ignore', category=FutureWarning)
#to suppress warnings in Sarima model
# warnings.simplefilter(action='ignore', category=UserWarning)
warnings.filterwarnings("ignore")
from statsmodels.tools.sm_exceptions import ConvergenceWarning
warnings.simplefilter('ignore', ConvergenceWarning)
from statsmodels.tools.sm_exceptions import InterpolationWarning
warnings.simplefilter('ignore', InterpolationWarning)
#importing high level interactive plotting libraries
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objs as go
# importing time series stats model
import statsmodels.api as sm
from statsmodels.api import tsa
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import adfuller, kpss
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
#importing Sarimax
from statsmodels.tsa.statespace.sarimax import SARIMAX
#sklearn library
from sklearn.preprocessing import MinMaxScaler
#importing fb prophet
from fbprophet import Prophet
from fbprophet.diagnostics import cross_validation
from fbprophet.plot import plot_plotly
#importing XG Boost
import xgboost as xgb
from xgboost import plot_importance, plot_tree
#importing LSTM libraries
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, LSTM, BatchNormalization
from tensorflow.keras.optimizers import Adam
from keras.callbacks import EarlyStopping
#loading the dataset
master=pd.read_csv('data_cleaned/master_dataset.csv')
#reading the head
master.head()
| order_id | customer_id | order_purchase_timestamp | order_estimated_delivery_date | qty | product_id | seller_id | shipping_limit_date | price | freight_value | ... | seller_state | seller_lat | seller_lng | customer_unique_id | customer_city | customer_state | customer_lat | customer_lng | review_score | total_amount | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | e481f51cbdc54678b7cc49136f2d6af7 | 9ef432eb6251297304e76186b10a928d | 2017-10-02 10:56:33 | 2017-10-18 | 1 | 87285b34884572647811a353c7ac498a | 3504c0cb71d7fa48d967e0e4c94d59d9 | 2017-10-06 11:07:15 | 29.99 | 8.72 | ... | SP | -23.680862 | -46.444311 | 7c396fd4830fd04220f754e42b4e5bff | sao paulo | SP | -23.577482 | -46.587077 | 4 | 29.99 |
| 1 | 128e10d95713541c87cd1a2e48201934 | a20e8105f23924cd00833fd87daa0831 | 2017-08-15 18:29:31 | 2017-08-28 | 1 | 87285b34884572647811a353c7ac498a | 3504c0cb71d7fa48d967e0e4c94d59d9 | 2017-08-21 20:05:16 | 29.99 | 7.78 | ... | SP | -23.680862 | -46.444311 | 3a51803cc0d012c3b5dc8b7528cb05f7 | sao paulo | SP | -23.564636 | -46.534401 | 4 | 29.99 |
| 2 | 0e7e841ddf8f8f2de2bad69267ecfbcf | 26c7ac168e1433912a51b924fbd34d34 | 2017-08-02 18:24:47 | 2017-08-15 | 1 | 87285b34884572647811a353c7ac498a | 3504c0cb71d7fa48d967e0e4c94d59d9 | 2017-08-08 18:37:31 | 29.99 | 7.78 | ... | SP | -23.680862 | -46.444311 | ef0996a1a279c26e7ecbd737be23d235 | sao paulo | SP | -23.600462 | -46.655318 | 5 | 29.99 |
| 3 | bfc39df4f36c3693ff3b63fcbea9e90a | 53904ddbea91e1e92b2b3f1d09a7af86 | 2017-10-23 23:26:46 | 2017-11-13 | 1 | 87285b34884572647811a353c7ac498a | 3504c0cb71d7fa48d967e0e4c94d59d9 | 2017-10-31 02:14:11 | 29.99 | 14.10 | ... | SP | -23.680862 | -46.444311 | e781fdcc107d13d865fc7698711cc572 | florianopolis | SC | -27.546553 | -48.497691 | 3 | 29.99 |
| 4 | 53cdb2fc8bc7dce0b6741e2150273451 | b0830fb4747a6c6d20dea0b8c802d7ef | 2018-07-24 20:41:37 | 2018-08-13 | 1 | 595fac2a385ac33a80bd5114aec74eb8 | 289cdb325fb7e7f891c38608bf9e0962 | 2018-07-30 03:24:27 | 118.70 | 22.76 | ... | SP | -19.807885 | -43.980818 | af07308b275d755c9edb36a90c618231 | barreiras | BA | -12.186877 | -44.540232 | 4 | 118.70 |
5 rows × 29 columns
#understanding the datatypes
master.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 110013 entries, 0 to 110012 Data columns (total 29 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 order_id 110013 non-null object 1 customer_id 110013 non-null object 2 order_purchase_timestamp 110013 non-null object 3 order_estimated_delivery_date 110013 non-null object 4 qty 110013 non-null int64 5 product_id 110013 non-null object 6 seller_id 110013 non-null object 7 shipping_limit_date 110013 non-null object 8 price 110013 non-null float64 9 freight_value 110013 non-null float64 10 product_name_lenght 110013 non-null float64 11 product_description_lenght 110013 non-null float64 12 product_photos_qty 110013 non-null float64 13 product_weight_g 110013 non-null float64 14 product_length_cm 110013 non-null float64 15 product_height_cm 110013 non-null float64 16 product_width_cm 110013 non-null float64 17 product_category_name_english 110013 non-null object 18 seller_city 110013 non-null object 19 seller_state 110013 non-null object 20 seller_lat 110013 non-null float64 21 seller_lng 110013 non-null float64 22 customer_unique_id 110013 non-null object 23 customer_city 110013 non-null object 24 customer_state 110013 non-null object 25 customer_lat 110013 non-null float64 26 customer_lng 110013 non-null float64 27 review_score 110013 non-null int64 28 total_amount 110013 non-null float64 dtypes: float64(14), int64(2), object(13) memory usage: 24.3+ MB
#checking if there is any null value
master.isnull().sum()
order_id 0 customer_id 0 order_purchase_timestamp 0 order_estimated_delivery_date 0 qty 0 product_id 0 seller_id 0 shipping_limit_date 0 price 0 freight_value 0 product_name_lenght 0 product_description_lenght 0 product_photos_qty 0 product_weight_g 0 product_length_cm 0 product_height_cm 0 product_width_cm 0 product_category_name_english 0 seller_city 0 seller_state 0 seller_lat 0 seller_lng 0 customer_unique_id 0 customer_city 0 customer_state 0 customer_lat 0 customer_lng 0 review_score 0 total_amount 0 dtype: int64
print(f" The number of unique orders : {master['order_id'].nunique()}")
print(f" The number of unique customers : {master['customer_unique_id'].nunique()}")
print(f" The total number of customer transactions specified by customer ids : {master['customer_id'].nunique()}")
print(f" Overall {round((master['customer_unique_id'].nunique()/master['customer_id'].nunique())*100, 2)} % transcations \
are made by new customers and {round((1 -(master['customer_unique_id'].nunique()/master['customer_id'].nunique()))*100,2)} % \
are made by repeat customers.")
print(f" The number of unique sellers {master['seller_id'].nunique()}")
print(f" The number of unique products sold at Olist platform {master['product_id'].nunique()}")
The number of unique orders : 95832 The number of unique customers : 92755 The total number of customer transactions specified by customer ids : 95832 Overall 96.79 % transcations are made by new customers and 3.21 % are made by repeat customers. The number of unique sellers 2965 The number of unique products sold at Olist platform 32072
We have a total of 110013 rows or orders with 28 features that describe the order made at the olist platform. It specifies the product category bought with qunatity purchased, unit price, purchase time stamp, delivery details, review score and customer and seller information.
Target Variable : total_amount : We have calculated this value after multiplying qty and price. This is the actual sales amount important for the business. We will be predicting sales amount to help business prepare for the the future.
Note: We have not considered freight charges in the calculation of 'total_amount' beacuse we found that when olist started its business it was outsourcing the logistics to third party and therefore we want to give business insight of only the sales from the products sold at the Olist platform.
We also found that Olist had accquired PAX, its logistic partner later in the year 2020, check here here for more details.
We have seen that the 'order_purchase_timestamp' has incorrect format. We will start with converting this column to date-time format and we will try to extract some features from dates for analysis.
#converting date columns which are in object format to datetime format
master['order_purchase_timestamp']=pd.to_datetime(master['order_purchase_timestamp'])
We can extract year, date, moth , weekday and day information from the dates.
#converting date columns which are in object format to datetime format
master['purchase_year']=pd.to_datetime(master['order_purchase_timestamp']).dt.year
master['purchase_month']=pd.to_datetime(master['order_purchase_timestamp']).dt.month
master['purchase_MMYYYY']=pd.to_datetime(master['order_purchase_timestamp']).dt.strftime('%m-%Y')
master['purchase_week']=pd.to_datetime(master['order_purchase_timestamp']).dt.isocalendar().week
master['purchase_dayofweek']=pd.to_datetime(master['order_purchase_timestamp']).dt.weekday
master['purchase_dayofmonth']=pd.to_datetime(master['order_purchase_timestamp']).dt.day
We will aggregate the total_amount by dates so that we can get a time series, meaning a dataframe with the total_amount column arranged in order as per dates. We will set the dates as index.
# Creating new DataFrame with daily frequency and number of orders
df_agg = master.groupby(pd.Grouper(key='order_purchase_timestamp', freq='D'))['total_amount'].sum().reset_index()
df_agg.set_index('order_purchase_timestamp', inplace=True)
df_agg.index.freq = 'D' # To keep pandas inference in check!
#reading top five rows
print(df_agg.head())
# checking the mean, max and count values.
print(df_agg.describe())
total_amount
order_purchase_timestamp
2016-09-15 269.94
2016-09-16 0.00
2016-09-17 0.00
2016-09-18 0.00
2016-09-19 0.00
total_amount
count 714.000000
mean 20881.443333
std 16038.400478
min 0.000000
25% 9275.570000
50% 19772.815000
75% 30819.712500
max 184834.170000
#checking the info
df_agg.info()
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 714 entries, 2016-09-15 to 2018-08-29 Freq: D Data columns (total 1 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 total_amount 714 non-null float64 dtypes: float64(1) memory usage: 11.2 KB
#checking head
df_agg.head()
| total_amount | |
|---|---|
| order_purchase_timestamp | |
| 2016-09-15 | 269.94 |
| 2016-09-16 | 0.00 |
| 2016-09-17 | 0.00 |
| 2016-09-18 | 0.00 |
| 2016-09-19 | 0.00 |
#index start
df_agg.head(1).index
DatetimeIndex(['2016-09-15'], dtype='datetime64[ns]', name='order_purchase_timestamp', freq='D')
#index end
df_agg.tail(1).index
DatetimeIndex(['2018-08-29'], dtype='datetime64[ns]', name='order_purchase_timestamp', freq='D')
We have a total of 714 observations staring form '2016-09-15' till '2018-08-29'.
Now that we have our original master data and time series data, we will try to explore some high level features of our master data and will go a level deeper for our time series.
We can plot a heat map to see which numerical features are highly correlated with the total_amount. This is just a high level overview to see the which features can impact sales and also the correlation among the features.
#canvas size
plt.figure(figsize=(18,12))
#correlation between all columns
corr_df= master.select_dtypes(include=np.number).corr()
# creating mask
mask = np.triu(np.ones_like(corr_df.corr()))
# plotting a triangle correlation heatmap
sns.heatmap(corr_df, cmap="RdPu", annot=True, mask=mask)
<AxesSubplot: >
fig=px.histogram(df_agg, x='total_amount')
fig.update_layout(
yaxis_title="Frequency",
xaxis_title="Revenue in Brazilian Real",
legend_title="",
title="Daily Revenue distribution from Sept 2016 to Aug 2018")
fig.show()
#we are creating pivot table with three index year, month and string MMYYY and getting sum of total_amount and number of orders
sales_df=master.pivot_table(values = ['order_id', 'total_amount']
, index=['purchase_year','purchase_month','purchase_MMYYYY']
, aggfunc={'order_id':'nunique', 'total_amount':'sum'})
from plotly.subplots import make_subplots
#to plot number of orders by MMYYY
trace1 = go.Bar(
x=sales_df.index.get_level_values(2),
y=sales_df['order_id'],
name='Orders',
marker=dict(
color='rgb(34,163,192)'
)
)
#to plot sum of total_amounts by MMYYY
trace2 = go.Scatter(
x=sales_df.index.get_level_values(2),
y=sales_df['total_amount'],
name='Revenue',
yaxis='y2' #using a right side y-axis
)
fig = make_subplots(specs=[[{"secondary_y": True}]])
fig.add_trace(trace1)
fig.add_trace(trace2,secondary_y=True)
fig['layout'].update(height = 600, width = 1000, title = 'Revenue and Orders by Month-Year',xaxis=dict(
tickangle=-90, title='Purchase Month'), yaxis=dict(title= 'Number of Orders'), yaxis2=dict(title= 'Revenue amount'))
fig.show()
We can see that both order made and total_amount are growing. There are some peaks that reappear after some month. The highes peak was recorded for Nov 2017.
print(f"The overall revenue earned as of Aug 2018 is {df_agg['total_amount'].sum()} Brazilian Real.")
The overall revenue earned as of Aug 2018 is 14909350.540000001 Brazilian Real.
df_agg.groupby(df_agg.index.year).sum()
| total_amount | |
|---|---|
| order_purchase_timestamp | |
| 2016 | 45756.84 |
| 2017 | 6722987.98 |
| 2018 | 8140605.72 |
(6722987.98-45756.84)/6722987.98
0.9931939726597577
(8140605.72-6722987.98)/8140605.72
0.17414155515690538
#plotting daily data to get high level picture
fig = px.line(df_agg, x=df_agg.index, y='total_amount')
# axis labels and title
fig.update_layout(
yaxis_title="Total Revenue earned (Brazilian Real)",
legend_title="",
title="Daily Revenue from Sept 2016 to Aug 2018"
)
# activate slider
fig.update_xaxes(rangeslider_visible=True)
#annotate peak
fig.add_annotation(x='2017-11-24', y= 184834.17, text=f'Black Friday Sale', yanchor='bottom',
showarrow=True, arrowhead=1, arrowsize=1, arrowwidth=2, arrowcolor="#636363",
ax=-20, ay=-30, font=dict(size=15, color="green", family="Courier New, monospace"),
align="left", bordercolor="green", borderwidth=2, bgcolor="#CFECEC", opacity=0.8)
fig.show()
Removing the data before Jan 01 2017 because there is are a lot of consecutive days with zero sales. Including them in our model may impact our forecasting.
It may be because the period of Sept 2016 to Dec 2016 was an experimental phase. We can find that we have continuous sales after Jan 2017.
#removing the rows before Jan 01 2017
daily_data=df_agg.loc[df_agg.index>='2017-01-01', :]
daily_data.tail()
| total_amount | |
|---|---|
| order_purchase_timestamp | |
| 2018-08-25 | 10891.40 |
| 2018-08-26 | 8526.19 |
| 2018-08-27 | 5542.90 |
| 2018-08-28 | 4088.37 |
| 2018-08-29 | 2670.54 |
print(f"The total number of datapoint to work on :{daily_data.shape[0]}")
The total number of datapoint to work on :606
This is challenging now to work with limited data to forecast Sales.
#saving the start and end dates separately
start_date=daily_data.index[0]
end_date=daily_data.index[-1]
from plotly.subplots import make_subplots
df1=master.groupby('purchase_week')['total_amount'].sum()
df2=master.groupby('purchase_dayofmonth')['total_amount'].sum()
fig=make_subplots(rows=1, cols=2, subplot_titles=("Revenue by weeks", "Revenue by days of month"))
fig.add_trace(go.Scatter(name='Weekly', x=df1.index, y=df1.values), row=1, col=1 )
fig.add_trace(go.Scatter(name='Monthly', x=df2.index, y=df2.values), row=1, col=2 )
fig.update_layout(height=500, width=1000,
title_text="Understanding patterns of revenue earned at weekly and monthly level")
fig.show()
Let us see which categories have the highest sales amount.
#the categories that are have highest sales amount
df=master.groupby('product_category_name_english')['total_amount'].sum().sort_values(ascending=False)
fig = px.bar(df, x= df.index,
y=df.values,
labels={'y':'Sales amount'},
title='Product Category by sales amount',
width=1000, height=800 )
fig.update_xaxes(tickangle= -90)
fig.show()
We will be decomposing the time series using additive decomposition so that we can observe the underlying trend, seasonality and residuals.
Additive Decomposition : $Trend$+$Seasonality$+$Residual$
# decompose the time series
decomposition = tsa.seasonal_decompose(daily_data, model='additive')
#saving copy to new datafrme
daily_df=daily_data.copy()
# add the decomposition data
daily_df['Trend'] = decomposition.trend
daily_df['Seasonal'] = decomposition.seasonal
daily_df['Residual'] = decomposition.resid
#plotting the actual and decomposed componenets of time series
cols = ["total_amount","Trend", "Seasonal", "Residual"]
fig = make_subplots(rows=4, cols=1, subplot_titles=cols)
for i, col in enumerate(cols):
fig.add_trace(
go.Scatter(x=daily_df.index, y=daily_df[col]),
row=i+1,
col=1
)
fig.update_layout(height=800, width=1000, showlegend=False)
fig.show()
plt.subplots(figsize=(10,5))
# sns.set_theme(style= 'darkgrid' )
plt.title('Pattern of revenue earned at week level')
df_week_check=daily_data.copy()
df_week_check['Weekday']= df_week_check.index.weekday
week_day=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
df_week_check['Weekday']=df_week_check['Weekday'].apply(lambda x: week_day[x])
sns.boxplot(data=df_week_check, x='Weekday', y='total_amount')
<AxesSubplot: title={'center': 'Pattern of revenue earned at week level'}, xlabel='Weekday', ylabel='total_amount'>
We can now proceed ahead with the checking stationarity of the time series so that we can apply modelling.
Forecasting is bulit on assumption that historical data is representative of the future. It is important for a time series to be stationary, if it is not it means that that data changes over time and it will be hard to forecast that data.
If the mean and variance of the time series are constant that means the time series is sationary. We will plot the rolling mean and rolling standard deviation for the time series to visually chacke for stationarity.
https://www.linkedin.com/pulse/qb4-bollinger-bands-rolling-mean-standard-deviation-alan-mcdowell/
#plotting rolling mean and standard deviation.
# Get Things Rolling
roll_mean = daily_df['total_amount'].rolling(window=7).mean()
roll_std = daily_df['total_amount'].rolling(window=7).std()
# Figure
fig, ax = plt.subplots(figsize=(15,7), facecolor='w')
ax.plot(daily_df['total_amount'], label='Original')
ax.plot(roll_mean, label='Rolling Mean')
ax.plot(roll_std, label='Rolling STD')
# Legend & Grid
ax.legend(loc='upper right')
plt.grid(linestyle=":", color='grey')
plt.show()
ADF test is used to determine the presence of unit root in the series, the presence of a unit root means the time series is non-stationary. Besides, the number of unit roots contained in the series corresponds to the number of differencing operations required to make the series stationary. The null and alternate hypothesis of this test are:
$H_0$ : The series has a unit root.
$H_a$: The series has no unit root.
If the null hypothesis is failed to be rejected (the p-value obtained is greater than the significance level (say 0.05)), this test may provide evidence that the series is non-stationary.
KPSS is another test for checking the stationarity of a time series. The null and alternate hypothesis for the KPSS test are opposite that of the ADF test.
$H_0$ : The process is trend stationary.
$H_a$: The series has a unit root (series is not stationary).
def perform_adf_test(df) -> None:
"""
Augmented Dickey Fuller Test
- The null hypothesis for this test is that there is a unit root.
- The alternate hypothesis is that there is no unit root in the series.
---
Args:
df (pd.DataFrame): Dataframe contains the timeseries data
Returns: None
"""
adf_stat, p_value, n_lags, n_observ, crit_vals, icbest = adfuller(df)
print('\nAugmented Dickey Fuller Test')
print('---'*15)
print('ADF Statistic: %f' % adf_stat)
print('p-value: %f' % p_value)
print(f'Number of lags used: {n_lags}')
print(f'Number of observations used: {n_observ}')
print(f'T values corresponding to adfuller test:')
for key, value in crit_vals.items():
print(key, value)
def perform_kpss_test(df) -> None:
"""
Kwiatkowski-Phillips-Schmidt-Shin test for stationary.
- The null hypothesis for the test is that the data is stationary.
- The alternate hypothesis for the test is that the data is not stationary.
---
Args:
df (pd.DataFrame): Dataframe that contains the timeseries data
Returns: None
"""
kpss_stat, p_value, n_lags, crit_vals = kpss(df, nlags='auto', store=False)
print('\nKwiatkowski-Phillips-Schmidt-Shin test')
print('---'*15)
print('KPSS Statistic: %f' % kpss_stat)
print('p-value: %f' % p_value)
print(f'Number of lags used: {n_lags}')
print(f'Critical values of KPSS test:')
for key, value in crit_vals.items():
print(key, value)
print("ADF and KPSS test on original total_amount")
print("******************************************")
perform_adf_test(daily_df['total_amount'])
perform_kpss_test(daily_df['total_amount'])
ADF and KPSS test on original total_amount ****************************************** Augmented Dickey Fuller Test --------------------------------------------- ADF Statistic: -3.575418 p-value: 0.006249 Number of lags used: 8 Number of observations used: 597 T values corresponding to adfuller test: 1% -3.4413510722333087 5% -2.8663934413235266 10% -2.5693547658168003 Kwiatkowski-Phillips-Schmidt-Shin test --------------------------------------------- KPSS Statistic: 2.703197 p-value: 0.010000 Number of lags used: 15 Critical values of KPSS test: 10% 0.347 5% 0.463 2.5% 0.574 1% 0.739
ADF & KPSS Results on Original sales data
Based on these results we can conclude that:
Taking refrence from statsmodels which says if :
KPSS indicates non-stationarity and ADF indicates stationarity - The series is difference stationary. Differencing is to be used to make series stationary. The differenced series is checked for stationarity.
We will proceed ahead with differencing the time series and recheck for stationarity.
We will try to difference the series by differencing it with a previous day observation.
#differencing with previous day
daily_df["day_difference"] = daily_df["total_amount"].diff(1)
print(" ADF and KPSS test on differnced data ")
print("******************************************")
perform_adf_test(daily_df["day_difference"].dropna())
perform_kpss_test(daily_df["day_difference"].dropna())
ADF and KPSS test on differnced data ****************************************** Augmented Dickey Fuller Test --------------------------------------------- ADF Statistic: -9.426039 p-value: 0.000000 Number of lags used: 12 Number of observations used: 592 T values corresponding to adfuller test: 1% -3.441444394224128 5% -2.8664345376276454 10% -2.569376663737217 Kwiatkowski-Phillips-Schmidt-Shin test --------------------------------------------- KPSS Statistic: 0.108792 p-value: 0.100000 Number of lags used: 29 Critical values of KPSS test: 10% 0.347 5% 0.463 2.5% 0.574 1% 0.739
ADF & KPSS Results on differenced data
Based on these results we can conclude that:
Both tests conclude that - The series is stationary.
We can proceed ahead with modelling.
We want to create some functions that we will call again and again. Hence we will be creating a test and tarin split
We will be splitting the series into train and test. We will not be splitting the train for a validation set as we have a limited number of data.
def train_test_split(df, train_end, test_set):
"""
Splits input dataframe into test and train set with 80% / 20%.
---
Args:
df : dataframe to split with datetime index.
train_end: end date of the train set (inclusive), it can be a datetime or string of format YYYY-MM-DD.
test_end: end date of the test set
Returns:
train_df (pd.DataFrame): Train Dataframe
test_df (pd.DataFrame): Test Dataframe
"""
train_set = df[df.index <= train_end]
test_set = df[df.index > train_end]
return train_set, test_set
train_end = '2018-4-30'
test_end = '2018-8-29'
train_df, test_df = train_test_split(daily_data, train_end, test_end)
print(f'The Train data has time range :Shape {train_df.shape} | {train_df.index[0]} to {train_df.index[-1]}')
print(f'The Test data has time range :Shape {test_df.shape} | {test_df.index[0]} to {test_df.index[-1]}' )
The Train data has time range :Shape (485, 1) | 2017-01-01 00:00:00 to 2018-04-30 00:00:00 The Test data has time range :Shape (121, 1) | 2018-05-01 00:00:00 to 2018-08-29 00:00:00
def plot_forecast(train_set, test_set, fc_series:pd.Series) -> None:
"""
This function plots the train, test and forecast values.
---
Args:
train_df: training dataframe with datetime index and only one column y
test_df : test dataframe with datetime index and only one column y
fc_series: forecast series
Returns: None
"""
fig = go.Figure()
fig.add_trace(go.Scatter(x=train_set.index, y=train_set.values, mode='lines', name="Train"))
fig.add_trace(go.Scatter(x=test_set.index, y=test_set.values, mode='lines', name="Test"))
fig.add_trace(go.Scatter(x=fc_series.index, y=fc_series.values, mode='lines', name="Forecast"))
fig.update_xaxes(rangeslider_visible=True)
fig.update_layout(
yaxis_title="Revenue amount",
xaxis_title="Date",
title="Daily Sales amount and forecast"
)
fig.show()
def plot_test_predictions(test_set, predictions) -> None:
"""
This functions plots test set vs predicted values.
---
Args:
test_df : test dataframe with datetime index and only one column y
predictions (predictions): prediction values (array or list)
Returns: None
"""
test_set=pd.Series(test_set, index= test_set.index)
predictions=pd.Series(predictions, index= test_set.index)
fig = go.Figure()
fig.add_trace(go.Scatter(x=test_set.index, y=test_set.values, mode='lines', name="Test"))
fig.add_trace(go.Scatter(x=predictions.index, y=predictions.values, mode='lines', name="Predictions"))
fig.update_xaxes(rangeslider_visible=True)
fig.update_layout(
yaxis_title="Revenue amount for test and predicted values",
xaxis_title="Date",
title="Daily revenue amount"
)
fig.show()
Throughout this notebook we will be using these two function to evaluate the performance of our model. We will be defining functions to calculate MAPE and RMSE. If we have Y as actual value and Predictions as predicted value for n number of observations then:
MAPE (Mean Absolute Percentage Error): It is a simple average of absolute percentage errors. It is calculated by
$$ \frac{1}{n} \sum_{i=1}^{n} {| \frac{Y_{actual_i} - Predictions_{i}}{Y_{actual_i}} |} \times{100} $$RMSE (Root Mean Sqaured Error) : It is the square root of the average of the squared difference between the original and predicted values in the data set.
$$ \sqrt{\frac{1}{n} \sum_{i=1}^{n} {{(Y_{actual_i} - Predictions_{i})}^2 }} $$def mape_metrics(test_set, predicted) -> float:
"""
This function calculates the MAPE.
---
Args:
test_set (pd.Series): test set filtered series with y
predicted (pd.Series): predicted series
Returns: float MAPE percentage
"""
# Calculate the MAPE value and return
mape_result=round(np.mean(np.abs((test_set - predicted) / test_set)) * 100, 2)
return mape_result
def rmse_metrics(test_set, predicted) -> float:
"""
This function calculates the RMSE.
---
Args:
test_set (pd.Series): test set filtered series with y
predicted (pd.Series): predicted series
Returns: float RMSE
"""
# Calculate the MAPE value and return
return round(np.sqrt(np.mean((test_set - predicted)**2)),2)
We will start with SARIMA model to account for the seasonality in our model. SARIMA is Seasonal Autoregressive Integrated Moving Average, which explicitly supports univariate time series data with a seasonal component. Before jumping on to modelling, we need to get a basic understanding of what orders for Auto gregressive and Moving average to choose. We will plot the ACF and PACF plots to find it out.
ACF : Auto correlation function, describes correlation between original and lagged series. PACF : Partial correlation function is same as ACF but it removes all intermediary effects of shorter lags, leaving only the direct effect visible.
def plot_acf_pacf(df, acf_lags: int, pacf_lags: int) -> None:
"""
This function plots the Autocorrelation and Partial Autocorrelation lags.
---
Args:
df (pd.DataFrame): Dataframe contains the order count and dates.
acf_lags (int): Number of ACF lags
pacf_lags (int): Number of PACF lags
Returns: None
"""
# Figure
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16,9), facecolor='w')
# ACF & PACF
plot_acf(df, ax=ax1, lags=acf_lags)
plot_pacf(df, ax=ax2, lags=pacf_lags, method='ywm')
# Labels
ax1.set_title(f"Autocorrelation {df.name}", fontsize=15, pad=10)
ax1.set_ylabel("Sales amount", fontsize=12)
ax1.set_xlabel("Lags (Days)", fontsize=12)
ax2.set_title(f"Partial Autocorrelation {df.name}", fontsize=15, pad=10)
ax2.set_ylabel("Sales amount", fontsize=12)
ax2.set_xlabel("Lags (Days)", fontsize=12)
# Legend & Grid
ax1.grid(linestyle=":", color='grey')
ax2.grid(linestyle=":", color='grey')
plt.show()
#plotting the ACF and PACF plot for original series
plot_acf_pacf(daily_df['total_amount'], acf_lags=56, pacf_lags= 56)
ACF plot:
PACF plot:
We will try to plot the ACF and PACF plot by double differncing means differencing the day_difference with seasonal differnce data.
#double differencing the column total_amount
daily_df['double_difference'] = daily_df['day_difference'].diff(7)
#plotting the ACF and PACF plot for double differenced series
plot_acf_pacf(daily_df['double_difference'].dropna(), acf_lags=56, pacf_lags= 56)
It little difficult to tell, simply from a time plot, what values of p and q are appropriate for the data. But we will try to find the appropriate orders.
ACF plot:
PACF plot:
The SARIMA model is specified
$$SARIMA(p, d, q) \times (P, D, Q)_s$$Where:
# Set Hyper-parameters
p, d, q = 1, 1, 1
P, D, Q = 0, 1, 1
s = 7
# Fit SARIMA
sarima_model = SARIMAX(train_df['total_amount'], order=(p, d, q), seasonal_order=(P, D, Q, s))
sarima_model_fit = sarima_model.fit(disp=0)
print(sarima_model_fit.summary())
SARIMAX Results
=========================================================================================
Dep. Variable: total_amount No. Observations: 485
Model: SARIMAX(1, 1, 1)x(0, 1, 1, 7) Log Likelihood -5042.977
Date: Wed, 15 Feb 2023 AIC 10093.954
Time: 08:55:12 BIC 10110.624
Sample: 01-01-2017 HQIC 10100.509
- 04-30-2018
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ar.L1 0.2609 0.044 5.920 0.000 0.175 0.347
ma.L1 -0.8489 0.032 -26.659 0.000 -0.911 -0.787
ma.S.L7 -0.9525 0.020 -48.088 0.000 -0.991 -0.914
sigma2 8.693e+07 1.58e-10 5.5e+17 0.000 8.69e+07 8.69e+07
===================================================================================
Ljung-Box (L1) (Q): 0.01 Jarque-Bera (JB): 279160.07
Prob(Q): 0.93 Prob(JB): 0.00
Heteroskedasticity (H): 12.95 Skew: 8.05
Prob(H) (two-sided): 0.00 Kurtosis: 120.42
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number 1.76e+32. Standard errors may be unstable.
# Plot diagnostics
sarima_model_fit.plot_diagnostics(figsize=(16,7))
plt.show()
Lets test the model on our training set:
# defining prediction period
pred_start_date = test_df.index[0]
pred_end_date = test_df.index[-1]
sarima_predictions = sarima_model_fit.predict(start=pred_start_date, end=pred_end_date)
sarima_residuals = test_df['total_amount'] - sarima_predictions
# Get evaluation data
sarima_root_mean_squared_error = rmse_metrics(test_df['total_amount'], sarima_predictions)
sarima_mape_error = mape_metrics(test_df['total_amount'], sarima_predictions)
print(f'Root Mean Squared Error | RMSE: {sarima_root_mean_squared_error}')
print(f'Mean Absolute Percentage Error | MAPE: {sarima_mape_error}')
Root Mean Squared Error | RMSE: 13810.6 Mean Absolute Percentage Error | MAPE: 68.99
We are able to get a MAPE of 69.99 % and RMSE of 13810.6.
plot_test_predictions(test_df['total_amount'], sarima_predictions)
Our baseline model is able to capture the seasonality and trend component but is not able to pick up the variations between the days..
We will try to forecast the sales for next 180 days. We have the 121 days know from our test data and we will try to see what our model forcasts for next 60 days.
# Forecast Window
days = 180
sarima_forecast = sarima_model_fit.forecast(days)
sarima_forecast_series = pd.Series(sarima_forecast, index=sarima_forecast.index)
# Since negative orders are not possible we can trim them.
sarima_forecast_series[sarima_forecast_series < 0] = 0
plot_forecast(train_df['total_amount'], test_df['total_amount'], sarima_forecast_series)
Let us include the exogenous variable holiday to imporve our model.
#reading the data from holiday file
holiday=pd.read_csv('data_cleaned/holiday.csv', index_col=0)
#reading head
holiday.info()
<class 'pandas.core.frame.DataFrame'> Index: 36 entries, 2017-01-01 to 2018-12-25 Data columns (total 1 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 is_holiday 36 non-null float64 dtypes: float64(1) memory usage: 576.0+ bytes
#converting index to date time format
holiday.index=pd.to_datetime(holiday.index)
#reading head
holiday.head()
| is_holiday | |
|---|---|
| 2017-01-01 | 1.0 |
| 2017-02-27 | 1.0 |
| 2017-02-28 | 1.0 |
| 2017-03-01 | 1.0 |
| 2017-04-14 | 1.0 |
Our holiday dataframe only has dates when there is a holiday but not all the dates. We can fill the missing dates and change the is_holiday column to int format.
#fill the missing dates
idx = pd.date_range('2017-01-01', '2018-12-31')
holiday = holiday.reindex(idx, fill_value=0)
#converting is_holiday column to int type
holiday['is_holiday']=holiday['is_holiday'].astype(int)
#checking the info
holiday.info()
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 730 entries, 2017-01-01 to 2018-12-31 Freq: D Data columns (total 1 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 is_holiday 730 non-null int32 dtypes: int32(1) memory usage: 8.6 KB
Since our data has impact of balck friday we can add few holidays like valentine day and black friday day as holiday.
#adding some more days as holidays like valentine day and black friday
#feb 14, Nov 24 2017 and Nov-23 2018
holiday.loc[((holiday.index.day== 14) & (holiday.index.month==2)),:]=1
holiday.loc[((holiday.index== '2017-11-24') | (holiday.index== '2018-11-23') ),:]=1
#plot to show holidays
plt.figure(figsize=(100,10))
sns.heatmap(holiday.T, cmap="flare", xticklabels=[])
plt.show()
#filter ing the holidays only upto the end of our test data period.
holiday_df=holiday.loc[holiday.index<='2018-08-29']
#adding the holiday data to our new dataframe.
dfex=pd.concat([daily_data, holiday_df], axis=1)
#plotting daily data to get high level picture
fig = px.line(dfex, x=dfex.index, y='total_amount')
# axis labels and title
fig.update_layout(
yaxis_title="Total Revenue earned (Brazilian Real)",
legend_title="",
title="Daily Revenue from Jan 2017 to Aug 2018"
)
v_line=dfex.loc[dfex['is_holiday']==1].index
for idx in v_line:
fig.add_vline(idx,line_width=0.5, line_dash="dash", line_color="red")
# activate slider
fig.update_xaxes(rangeslider_visible=True)
fig.show()
We can split the df into test and train so that we can do some moseling on it.
#splitting into test and train
train_dfex, test_dfex = train_test_split(dfex, train_end, test_end)
#instantiating order variables
p = range(1, 5)
d = range(1,2)
q = range(1, 3)
s = 7
P=range(0,1)
D=range(1,2)
Q=range(1,3)
pdq = list(itertools.product(p, d, q))
seasonal_pdq=list(itertools.product(P, D, Q))
seasonal_PDQS= [(x[0], x[1], x[2], s) for x in seasonal_pdq]
def grid_search_sarimax(train_set, test_set) -> None:
"""
This function perform a grid search for SARIMAX model Hyper-Parameters.
---
Args:
train_set (pd.DataFrame): Training set for grid search.
Returns: None
"""
# Supress UserWarnings
warnings.simplefilter('ignore', category=UserWarning)
#instantiating variables
pred_start_date = test_set.index[0]
pred_end_date = test_set.index[-1]
summary=pd.DataFrame(columns=['Model', 'AIC', 'BIC', 'MSE', 'SSE', 'RMSE', 'MAPE'])
data={}
# Grid Search
for order in pdq:
for seasonal_order in seasonal_PDQS:
model = SARIMAX(train_set['total_amount'],
order=order,
seasonal_order=seasonal_order,
exog=(train_set[['is_holiday']]))
results = model.fit(disp=0)
predictions = results.predict(start=pred_start_date, end=pred_end_date,
exog=(test_set[['is_holiday']]))
data= { "Model":f'{order}x{seasonal_order}',
"AIC":results.aic,
"BIC":results.bic,
"MSE":results.mse,
"SSE":results.sse,
"RMSE": rmse_metrics(test_set['total_amount'], predictions),
"MAPE": mape_metrics(test_set['total_amount'], predictions) }
summary=pd.concat([summary, pd.DataFrame(data, columns=summary.columns, index=[1])], ignore_index=True)
return summary
grid_search_sarimax(train_dfex, test_dfex)
| Model | AIC | BIC | MSE | SSE | RMSE | MAPE | |
|---|---|---|---|---|---|---|---|
| 0 | (1, 1, 1)x(0, 1, 1, 7) | 10098.803428 | 10119.641010 | 8.579247e+07 | 4.160935e+10 | 16149.51 | 81.59 |
| 1 | (1, 1, 1)x(0, 1, 2, 7) | 10098.205815 | 10123.210914 | 8.590413e+07 | 4.166351e+10 | 14130.02 | 70.50 |
| 2 | (1, 1, 2)x(0, 1, 1, 7) | 10186.915064 | 10211.920163 | 8.660065e+07 | 4.200131e+10 | 13312.72 | 65.66 |
| 3 | (1, 1, 2)x(0, 1, 2, 7) | 10188.869615 | 10218.042231 | 8.654337e+07 | 4.197354e+10 | 13383.17 | 66.10 |
| 4 | (2, 1, 1)x(0, 1, 1, 7) | 10098.721890 | 10123.726989 | 8.607758e+07 | 4.174762e+10 | 13809.34 | 68.67 |
| 5 | (2, 1, 1)x(0, 1, 2, 7) | 10186.651548 | 10215.824164 | 8.653358e+07 | 4.196879e+10 | 13448.51 | 66.51 |
| 6 | (2, 1, 2)x(0, 1, 1, 7) | 10086.691009 | 10115.863625 | 8.314456e+07 | 4.032511e+10 | 16061.22 | 81.02 |
| 7 | (2, 1, 2)x(0, 1, 2, 7) | 10184.337084 | 10217.677216 | 8.371883e+07 | 4.060363e+10 | 15112.30 | 75.99 |
| 8 | (3, 1, 1)x(0, 1, 1, 7) | 10182.809279 | 10211.981894 | 8.421900e+07 | 4.084622e+10 | 14936.32 | 75.13 |
| 9 | (3, 1, 1)x(0, 1, 2, 7) | 10184.273723 | 10217.613855 | 8.391110e+07 | 4.069689e+10 | 15021.04 | 75.55 |
| 10 | (3, 1, 2)x(0, 1, 1, 7) | 10148.537015 | 10181.877146 | 8.832691e+07 | 4.283855e+10 | 16411.13 | 83.02 |
| 11 | (3, 1, 2)x(0, 1, 2, 7) | 10183.265203 | 10220.772851 | 8.332192e+07 | 4.041113e+10 | 15102.15 | 75.95 |
| 12 | (4, 1, 1)x(0, 1, 1, 7) | 10181.694694 | 10215.034826 | 8.339000e+07 | 4.044415e+10 | 15026.73 | 75.58 |
| 13 | (4, 1, 1)x(0, 1, 2, 7) | 10183.326100 | 10220.833749 | 8.316294e+07 | 4.033403e+10 | 15083.97 | 75.86 |
| 14 | (4, 1, 2)x(0, 1, 1, 7) | 10175.171123 | 10212.678772 | 8.319527e+07 | 4.034971e+10 | 15159.34 | 76.28 |
| 15 | (4, 1, 2)x(0, 1, 2, 7) | 10176.405676 | 10218.080841 | 8.288848e+07 | 4.020091e+10 | 15167.14 | 76.28 |
SARIMA Order (1,1,2) x (0, 1,1,7) gives the minimum MAPE and RMSE on the test data. We will chose this model as a better model over our baseline.
# Set Hyper-parameters
p, d, q = 1, 1, 2
P, D, Q = 0, 1, 1
s = 7
# Fit SARIMA
sarimax_model = SARIMAX(train_dfex['total_amount'],
order=(p, d, q),
seasonal_order=(P, D, Q, s),
exog=(train_dfex[['is_holiday']]))
sarimax_model_fit = sarimax_model.fit(disp=0)
print(sarimax_model_fit.summary())
SARIMAX Results
===========================================================================================
Dep. Variable: total_amount No. Observations: 485
Model: SARIMAX(1, 1, 2)x(0, 1, [1], 7) Log Likelihood -5087.458
Date: Wed, 15 Feb 2023 AIC 10186.915
Time: 08:55:41 BIC 10211.920
Sample: 01-01-2017 HQIC 10196.747
- 04-30-2018
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
is_holiday 4390.8251 1903.409 2.307 0.021 660.212 8121.439
ar.L1 0.0746 0.527 0.142 0.887 -0.959 1.108
ma.L1 -0.6470 0.534 -1.211 0.226 -1.694 0.400
ma.L2 -0.1379 0.409 -0.337 0.736 -0.939 0.664
ma.S.L7 -0.9281 0.039 -23.640 0.000 -1.005 -0.851
sigma2 1.719e+08 0.583 2.95e+08 0.000 1.72e+08 1.72e+08
===================================================================================
Ljung-Box (L1) (Q): 0.00 Jarque-Bera (JB): 200524.31
Prob(Q): 0.99 Prob(JB): 0.00
Heteroskedasticity (H): 11.31 Skew: 7.11
Prob(H) (two-sided): 0.00 Kurtosis: 102.43
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number 2.94e+22. Standard errors may be unstable.
#predicting
pred_start_date = test_df.index[0]
pred_end_date = test_df.index[-1]
sarimax_predictions = sarimax_model_fit.predict(start=pred_start_date, end=pred_end_date, exog=(test_dfex[['is_holiday']]) )
sarimax_root_mean_squared_error = rmse_metrics(test_dfex['total_amount'], sarimax_predictions)
sarimax_mape=mape_metrics(test_dfex['total_amount'], sarimax_predictions)
print(f'Root Mean Squared Error | RMSE: {sarimax_root_mean_squared_error}')
print(f'Mean Absolute Percentage Error | MAPE: {sarimax_mape}')
Root Mean Squared Error | RMSE: 13312.72 Mean Absolute Percentage Error | MAPE: 65.66
plot_test_predictions(test_dfex['total_amount'], sarimax_predictions)
# Forecast Window
exog=holiday.loc[(holiday.index> '2018-04-30') & (holiday.index<= '2018-10-27') ]
exog_df=exog.copy()
days = 180
exog_param=(exog_df[['is_holiday']])
sarimax_forecast = sarimax_model_fit.forecast(days, exog=exog_param)
sarimax_forecast_series = pd.Series(sarimax_forecast, index=sarimax_forecast.index)
plot_forecast(train_dfex['total_amount'], test_dfex['total_amount'], sarimax_forecast_series)
FB Prophet is a forecasting package in Python that was developed by Facebook’s data science research team. The goal of the package is to give business users a powerful and easy-to-use tool to help forecast business results without needing to be an expert in time series analysis. We will apply this model and see how it performs.
Faecbook prophet needs data in a certain format to be able to process it. The input to Prophet is always a dataframe with two columns: ds and y. The ds (datestamp) column should be of a format expected by Pandas, ideally YYYY-MM-DD for a date or YYYY-MM-DD HH:MM:SS for a timestamp. The y column must be numeric, and represents the measurement here in our case it is total_amount.
#preparing the dataframe for fbProphet
prophet_df=dfex['total_amount'].reset_index()
prophet_df.rename(columns={"index": "ds", "total_amount": "y"}, inplace=True)
#using our original train_df and test_df we will convert them into prophet train andt test set.
prophet_train = train_df["total_amount"].reset_index()
prophet_train.rename(columns={"order_purchase_timestamp": "ds", "total_amount": "y"}, inplace=True)
prophet_test = test_df["total_amount"].reset_index()
prophet_test.rename(columns={"order_purchase_timestamp": "ds", "total_amount": "y"}, inplace=True)
prophet_df.head()
| ds | y | |
|---|---|---|
| 0 | 2017-01-01 | 0.0 |
| 1 | 2017-01-02 | 0.0 |
| 2 | 2017-01-03 | 0.0 |
| 3 | 2017-01-04 | 0.0 |
| 4 | 2017-01-05 | 396.9 |
Since we observed that our data has positive trend and seasonality, we will set growth ='linear' and let the model find out appropriate seasonality by making yearly_seaonality, daily_seasonality and weekly_seasonality = True.
#instantiate the model
fb_baseline = Prophet(growth='linear',
yearly_seasonality=True,
daily_seasonality=True,
weekly_seasonality=True)
fb_baseline.fit(prophet_train)
<fbprophet.forecaster.Prophet at 0x1d492326130>
#make predictions dataframe
future_base = fb_baseline.make_future_dataframe(periods=len(test_df), freq="D")
#make a forecast
forecast_base = fb_baseline.predict(future_base)
forecast_base[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()
| ds | yhat | yhat_lower | yhat_upper | |
|---|---|---|---|---|
| 601 | 2018-08-25 | 38703.402238 | 27850.514400 | 49974.134276 |
| 602 | 2018-08-26 | 39793.914774 | 28853.592346 | 50549.915252 |
| 603 | 2018-08-27 | 47564.058526 | 37425.216632 | 58963.497141 |
| 604 | 2018-08-28 | 48007.781107 | 37422.322128 | 58954.139575 |
| 605 | 2018-08-29 | 46813.235399 | 35654.962406 | 58255.634010 |
forecast_base[-121:].reset_index()['yhat']
0 41595.478779
1 40191.984960
2 40231.082802
3 40782.308981
4 34046.678525
...
116 38703.402238
117 39793.914774
118 47564.058526
119 48007.781107
120 46813.235399
Name: yhat, Length: 121, dtype: float64
#evaluating on test set
fb_baseline_mape = mape_metrics(prophet_test['y'], forecast_base[-121:].reset_index()['yhat'] )
fb_baseline_rmse = rmse_metrics(prophet_test['y'], forecast_base[-121:].reset_index()['yhat'] )
print(f'Root Mean Squared Error | RMSE: {fb_baseline_rmse}')
print(f'Mean Absolute Percentage Error | MAPE: {fb_baseline_mape}')
Root Mean Squared Error | RMSE: 14904.05 Mean Absolute Percentage Error | MAPE: 75.28
from fbprophet.plot import plot_plotly
fig = plot_plotly(fb_baseline, forecast_base)
fig.update_layout(
title="Daily Sales amount",
xaxis_title="Date",
yaxis_title="Revenue amount"
)
fig.show()
#preparing the holiday dataframe as per prophet
holiday_df_fb = pd.DataFrame({
'holiday': 'Brazil holidays',
'ds': pd.to_datetime(holiday_df.loc[holiday_df['is_holiday']==1].index)})
fb_baseline_holi = Prophet(growth='linear',
holidays=holiday_df_fb,
yearly_seasonality=True,
daily_seasonality=True,
weekly_seasonality=True).add_country_holidays(country_name='BR')
fb_baseline_holi.fit(prophet_train)
<fbprophet.forecaster.Prophet at 0x1d48b492b80>
#preparing forcast dataframe
future_base_holi = fb_baseline_holi.make_future_dataframe(periods=len(test_df), freq="D")
#doing acrtual forecast
forecast_base_holi = fb_baseline_holi.predict(future_base_holi)
forecast_base_holi[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()
| ds | yhat | yhat_lower | yhat_upper | |
|---|---|---|---|---|
| 601 | 2018-08-25 | 39680.385242 | 27493.733309 | 50152.633038 |
| 602 | 2018-08-26 | 40618.882951 | 29738.338508 | 51236.224444 |
| 603 | 2018-08-27 | 48606.375757 | 37874.992661 | 59519.491629 |
| 604 | 2018-08-28 | 48832.434226 | 37915.910928 | 59524.345705 |
| 605 | 2018-08-29 | 47909.521241 | 36820.957799 | 59033.048757 |
#evaluating on test set
fb_baseline_holi_mape=mape_metrics(prophet_test['y'], forecast_base_holi[-121:].reset_index()['yhat'] )
fb_baseline_holi_rmse=rmse_metrics(prophet_test['y'], forecast_base_holi[-121:].reset_index()['yhat'] )
print(f'Root Mean Squared Error | RMSE: {fb_baseline_holi_rmse}')
print(f'Mean Absolute Percentage Error | MAPE: {fb_baseline_holi_mape}')
Root Mean Squared Error | RMSE: 15393.64 Mean Absolute Percentage Error | MAPE: 77.88
We will try to tune the following parameters i.e. seasonality_mode, changepoint_prior_scale and holiday_prior_scale.
Lets us discuss these parameters.
Seasonality_mode : By default FB Prophet select additive but we want to give it a option to try multiplicative.
holidays_prior_scale. This parameter determines how much of an effect holidays should have on your predictions. Defalut is 10, This could also be tuned on a range of [0.01, 10].
Changepoints are the points in your data where there are sudden and abrupt changes in the trend. The other parameter, changepoint_prior_scale, tt determines the flexibility of the trend, and in particular how much the trend changes at the trend changepoints. The default of 0.05 works for many time series, but this could be tuned; a range of [0.001, 0.5] would likely be about right.
from sklearn.model_selection import ParameterGrid
params_grid = {'seasonality_mode':('multiplicative','additive'),
'changepoint_prior_scale':[0.1, 0.4],
'holidays_prior_scale':[0.1,0.2,0.3,0.4,0.5]}
grid = ParameterGrid(params_grid)
cnt = 0
for p in grid:
cnt = cnt+1
print('Total Possible Models',cnt)
Total Possible Models 20
tunning_summary = pd.DataFrame(columns=['Param', 'RMSE', 'MAPE'])
import random
val=pd.DataFrame()
for p in grid:
# print(p)
random.seed(0)
train_model =Prophet(growth='linear',
changepoint_prior_scale = p['changepoint_prior_scale'],
holidays_prior_scale = p['holidays_prior_scale'],
seasonality_mode = p['seasonality_mode'],
seasonality_prior_scale=20,
weekly_seasonality=True,
daily_seasonality = True,
yearly_seasonality = True,
holidays=holiday_df_fb)
train_model.add_country_holidays(country_name='BR')
train_model.fit(prophet_train)
grid_forecast = train_model.make_future_dataframe(periods=121, freq='D',include_history = False)
grid_forecast = train_model.predict(grid_forecast)
val_df=grid_forecast[['ds','yhat']]
mapes=np.mean(np.abs((prophet_test['y']-val_df['yhat'])/prophet_test['y']))*100
rmses = np.sqrt(np.mean((prophet_test['y']-val_df['yhat'])**2))
# Find the best parameters
data= {"Param": f'{p}',
"RMSE": rmses,
"MAPE": mapes }
tunning_summary=pd.concat([tunning_summary, pd.DataFrame(data, columns=tunning_summary.columns, index=[1])], ignore_index=True)
tunning_summary
| Param | RMSE | MAPE | |
|---|---|---|---|
| 0 | {'changepoint_prior_scale': 0.1, 'holidays_pri... | 14323.127238 | 69.067258 |
| 1 | {'changepoint_prior_scale': 0.1, 'holidays_pri... | 13831.198684 | 68.977829 |
| 2 | {'changepoint_prior_scale': 0.1, 'holidays_pri... | 15889.884132 | 76.295847 |
| 3 | {'changepoint_prior_scale': 0.1, 'holidays_pri... | 13933.034091 | 69.529761 |
| 4 | {'changepoint_prior_scale': 0.1, 'holidays_pri... | 15910.722963 | 76.326488 |
| 5 | {'changepoint_prior_scale': 0.1, 'holidays_pri... | 13961.363212 | 69.692900 |
| 6 | {'changepoint_prior_scale': 0.1, 'holidays_pri... | 15976.943243 | 76.537357 |
| 7 | {'changepoint_prior_scale': 0.1, 'holidays_pri... | 13995.316724 | 69.893073 |
| 8 | {'changepoint_prior_scale': 0.1, 'holidays_pri... | 13712.466581 | 64.318053 |
| 9 | {'changepoint_prior_scale': 0.1, 'holidays_pri... | 14014.377104 | 69.986390 |
| 10 | {'changepoint_prior_scale': 0.4, 'holidays_pri... | 15630.137007 | 74.832688 |
| 11 | {'changepoint_prior_scale': 0.4, 'holidays_pri... | 13140.371293 | 64.690013 |
| 12 | {'changepoint_prior_scale': 0.4, 'holidays_pri... | 13052.161061 | 60.729572 |
| 13 | {'changepoint_prior_scale': 0.4, 'holidays_pri... | 13153.335184 | 64.719378 |
| 14 | {'changepoint_prior_scale': 0.4, 'holidays_pri... | 12283.063672 | 55.248070 |
| 15 | {'changepoint_prior_scale': 0.4, 'holidays_pri... | 13158.441265 | 64.740393 |
| 16 | {'changepoint_prior_scale': 0.4, 'holidays_pri... | 12151.948994 | 53.336248 |
| 17 | {'changepoint_prior_scale': 0.4, 'holidays_pri... | 13175.698072 | 64.846211 |
| 18 | {'changepoint_prior_scale': 0.4, 'holidays_pri... | 12142.685263 | 51.453872 |
| 19 | {'changepoint_prior_scale': 0.4, 'holidays_pri... | 13186.650847 | 64.915281 |
#filtering the row with minimum MAPE
tunning_summary.loc[tunning_summary['MAPE']==tunning_summary['MAPE'].min(),:]
| Param | RMSE | MAPE | |
|---|---|---|---|
| 18 | {'changepoint_prior_scale': 0.4, 'holidays_pri... | 12142.685263 | 51.453872 |
#getting the parameter values
tunning_summary.iloc[18, :]['Param']
"{'changepoint_prior_scale': 0.4, 'holidays_prior_scale': 0.5, 'seasonality_mode': 'multiplicative'}"
#fitting the model on tuned parameters
fb_tuned = Prophet(growth='linear',
changepoint_prior_scale= 0.4,
holidays_prior_scale= 0.5,
seasonality_mode= 'multiplicative',
seasonality_prior_scale=20,
holidays=holiday_df_fb,
yearly_seasonality=True,
daily_seasonality=True,
weekly_seasonality=True).add_country_holidays(country_name='BR')
fb_tuned.fit(prophet_train)
<fbprophet.forecaster.Prophet at 0x1d495095cd0>
#make future dataframe
fb_tuned_future=fb_tuned.make_future_dataframe(periods=121, freq="D")
# ,include_history = False
#forecasting
forecast_tuned = fb_tuned.predict(fb_tuned_future)
forecast_tuned [['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()
| ds | yhat | yhat_lower | yhat_upper | |
|---|---|---|---|---|
| 601 | 2018-08-25 | 21454.618193 | 11376.194819 | 32451.621645 |
| 602 | 2018-08-26 | 22754.149194 | 11899.356892 | 32786.712348 |
| 603 | 2018-08-27 | 35314.865637 | 24548.772837 | 46298.220030 |
| 604 | 2018-08-28 | 34959.709975 | 24416.002477 | 45298.598274 |
| 605 | 2018-08-29 | 33316.214377 | 23193.749826 | 44135.647453 |
#evaluating on test set
fb_tuned_mape=mape_metrics(prophet_test['y'], forecast_tuned[-121:].reset_index()['yhat'] )
fb_tuned_rmse=rmse_metrics(prophet_test['y'], forecast_tuned[-121:].reset_index()['yhat'] )
print(f'Root Mean Squared Error | RMSE: {fb_tuned_rmse}')
print(f'Mean Absolute Percentage Error | MAPE: {fb_tuned_mape}')
Root Mean Squared Error | RMSE: 12142.69 Mean Absolute Percentage Error | MAPE: 51.45
fig = plot_plotly(fb_tuned, forecast_tuned )
fig.update_layout(
title="Daily Sales amount",
xaxis_title="Date",
yaxis_title="Revenue amount"
)
fig.show()
# prophet_test['y'].mean()
# forecast_tuned['yhat'][-121:].mean()
| Model | RMSE | MAPE |
|---|---|---|
| SARIMA(1,1,1)(0,1,1)(7) | 13810.59 | 68.99 |
| SARIMAX(1,1,2)(0,1,1)(7) Including impact of holidays | 13312.72 | 65.66 |
| Baseline Prophet | 27437.77 | 71.78 |
| Baseline Prophet with holiday | 15393.64 | 77.88 |
| Tuned Prophet with holiday | 12142.69 | 51.45 |
XGBoost is an efficient implementation of gradient boosting that can be used for regression predictive modeling. We will try to implement this model.
For XGBoost we cannot feed a time series as such, we will have to extract date time features from our order timestamp in order to predict 'total_amount'.
We will be creating a function to extract features from date. We will alos add holiday as a feature.
#function to extract features from date
def create_features(dataframe, label=None):
df=dataframe.copy()
df['date'] = df.index
df['dayofweek'] = df['date'].dt.dayofweek
df['quarter'] = df['date'].dt.quarter
df['month'] = df['date'].dt.month
df['year'] = df['date'].dt.year
df['dayofyear'] = df['date'].dt.dayofyear
df['dayofmonth'] = df['date'].dt.day
df['weekofyear'] = df['date'].dt.isocalendar().week.astype(int)
X = df[['dayofweek', 'quarter', 'month', 'year', 'dayofyear','dayofmonth', 'weekofyear']]
if label:
y = df[label]
return X, y
return X
#creating dataframe with only required columns
daily_data_xgb=dfex[['total_amount', 'is_holiday']]
#Separating X and y
X, y= create_features(daily_data_xgb, label='total_amount')
X['is_holiday']=daily_data_xgb['is_holiday']
X.head()
| dayofweek | quarter | month | year | dayofyear | dayofmonth | weekofyear | is_holiday | |
|---|---|---|---|---|---|---|---|---|
| 2017-01-01 | 6 | 1 | 1 | 2017 | 1 | 1 | 52 | 1 |
| 2017-01-02 | 0 | 1 | 1 | 2017 | 2 | 2 | 1 | 0 |
| 2017-01-03 | 1 | 1 | 1 | 2017 | 3 | 3 | 1 | 0 |
| 2017-01-04 | 2 | 1 | 1 | 2017 | 4 | 4 | 1 | 0 |
| 2017-01-05 | 3 | 1 | 1 | 2017 | 5 | 5 | 1 | 0 |
#splitting the data for train andt test
train_end = '2018-4-30'
test_end = '2018-8-29'
X_train, X_test = train_test_split(X, train_end, test_end)
y_train, y_test = train_test_split(y, train_end, test_end)
Let us apply a baseline XG Boost Regression and we will then evaluate its performance.
#instantiate the XGBoost Model
reg = xgb.XGBRegressor(n_estimators=1000, max_depth=3)
reg.fit(X_train, y_train,
eval_set=[(X_train, y_train), (X_test, y_test)],
early_stopping_rounds=50,
verbose=True)
[0] validation_0-rmse:20091.22974 validation_1-rmse:25494.23244 [1] validation_0-rmse:15569.06877 validation_1-rmse:19173.80315 [2] validation_0-rmse:12561.60000 validation_1-rmse:15332.13670 [3] validation_0-rmse:10761.74494 validation_1-rmse:13457.44742 [4] validation_0-rmse:9153.27296 validation_1-rmse:12329.47857 [5] validation_0-rmse:8018.61914 validation_1-rmse:11726.32667 [6] validation_0-rmse:7526.28509 validation_1-rmse:11550.97415 [7] validation_0-rmse:6865.40326 validation_1-rmse:11500.50474 [8] validation_0-rmse:6378.20680 validation_1-rmse:11493.59336 [9] validation_0-rmse:5999.63606 validation_1-rmse:11537.69003 [10] validation_0-rmse:5824.62760 validation_1-rmse:11583.81280 [11] validation_0-rmse:5589.63328 validation_1-rmse:11600.04201 [12] validation_0-rmse:5387.86805 validation_1-rmse:11544.21935 [13] validation_0-rmse:5348.55424 validation_1-rmse:11567.96099 [14] validation_0-rmse:5165.61167 validation_1-rmse:11609.90550 [15] validation_0-rmse:5020.64075 validation_1-rmse:11621.18955 [16] validation_0-rmse:4960.49363 validation_1-rmse:11648.22786 [17] validation_0-rmse:4917.95579 validation_1-rmse:11661.42073 [18] validation_0-rmse:4864.95097 validation_1-rmse:11612.14894 [19] validation_0-rmse:4766.17509 validation_1-rmse:11612.73206 [20] validation_0-rmse:4741.14973 validation_1-rmse:11598.55214 [21] validation_0-rmse:4665.92004 validation_1-rmse:11601.79049 [22] validation_0-rmse:4588.43864 validation_1-rmse:11607.42071 [23] validation_0-rmse:4518.74967 validation_1-rmse:11592.33710 [24] validation_0-rmse:4494.93263 validation_1-rmse:11597.64204 [25] validation_0-rmse:4365.38962 validation_1-rmse:11593.78393 [26] validation_0-rmse:4349.07234 validation_1-rmse:11561.14553 [27] validation_0-rmse:4297.00537 validation_1-rmse:11562.03597 [28] validation_0-rmse:4262.70024 validation_1-rmse:11595.53519 [29] validation_0-rmse:4208.82143 validation_1-rmse:11573.89872 [30] validation_0-rmse:4173.03240 validation_1-rmse:11564.39727 [31] validation_0-rmse:4136.56835 validation_1-rmse:11608.18702 [32] validation_0-rmse:4105.81737 validation_1-rmse:11679.87197 [33] validation_0-rmse:4040.26708 validation_1-rmse:11672.73164 [34] validation_0-rmse:3978.29881 validation_1-rmse:11672.29589 [35] validation_0-rmse:3944.94785 validation_1-rmse:11670.91868 [36] validation_0-rmse:3903.84378 validation_1-rmse:11667.00166 [37] validation_0-rmse:3826.52600 validation_1-rmse:11663.81879 [38] validation_0-rmse:3777.20044 validation_1-rmse:11662.06124 [39] validation_0-rmse:3736.47090 validation_1-rmse:11665.76056 [40] validation_0-rmse:3703.70287 validation_1-rmse:11670.38346 [41] validation_0-rmse:3651.95240 validation_1-rmse:11666.90757 [42] validation_0-rmse:3644.95348 validation_1-rmse:11675.03502 [43] validation_0-rmse:3628.32189 validation_1-rmse:11675.67698 [44] validation_0-rmse:3602.89573 validation_1-rmse:11673.85435 [45] validation_0-rmse:3579.18469 validation_1-rmse:11677.71554 [46] validation_0-rmse:3574.58358 validation_1-rmse:11677.38926 [47] validation_0-rmse:3550.78685 validation_1-rmse:11686.87648 [48] validation_0-rmse:3515.18256 validation_1-rmse:11698.21647 [49] validation_0-rmse:3501.68779 validation_1-rmse:11703.51374 [50] validation_0-rmse:3485.54525 validation_1-rmse:11701.81938 [51] validation_0-rmse:3475.84810 validation_1-rmse:11704.13086 [52] validation_0-rmse:3461.98443 validation_1-rmse:11705.18792 [53] validation_0-rmse:3432.05624 validation_1-rmse:11707.28888 [54] validation_0-rmse:3406.63698 validation_1-rmse:11707.95335 [55] validation_0-rmse:3403.97322 validation_1-rmse:11707.69042 [56] validation_0-rmse:3352.98727 validation_1-rmse:11668.99701 [57] validation_0-rmse:3336.82736 validation_1-rmse:11671.72883 [58] validation_0-rmse:3311.60893 validation_1-rmse:11678.11015
XGBRegressor(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=3, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
n_estimators=1000, n_jobs=None, num_parallel_tree=None,
predictor=None, random_state=None, ...)
We can see how features impact the prediction of total_amount.
#plot feature importance
plot_importance(reg, height=0.9)
<AxesSubplot: title={'center': 'Feature importance'}, xlabel='F score', ylabel='Features'>
#evaluating on test set
predictions_xgb = reg.predict(X_test)
#calculating MAPE and RMSE
mape_xgb=mape_metrics(y_test, predictions_xgb )
rmse_xgb=rmse_metrics(y_test, predictions_xgb )
print(f'Root Mean Squared Error | RMSE: {rmse_xgb}')
print(f'Mean Absolute Percentage Error | MAPE: {mape_xgb}')
Root Mean Squared Error | RMSE: 11493.59 Mean Absolute Percentage Error | MAPE: 52.18
#if doing forecasting
start_date=datetime(2018,8,30)
end_date=start_date+pd.DateOffset(days=30)
full_range = pd.date_range(start=start_date, end=end_date, freq="D")
filtered_data= holiday.loc[(holiday.index >= start_date) & (holiday.index <= end_date)]
forecast_df=pd.DataFrame(data=filtered_data, index=full_range, columns=['is_holiday'])
#creating features for new dates
X_forcast=create_features(forecast_df)
X_forcast['is_holiday']=forecast_df['is_holiday']
#making forecast
xgb_forecast=reg.predict(X_forcast)
#creating dataframe for all the y values
y_train_df = pd.DataFrame(y_train, train_df.index, columns=['total_amount'] )
y_test_df = pd.DataFrame(y_test, test_df.index, columns=['total_amount'] )
xgb_forecast_df=pd.DataFrame(xgb_forecast, forecast_df.index, columns=['Forecast'] )
# predictions_xgb_df=pd.DataFrame(predictions_xgb, y_test_df.index, columns=['Predictions'] )
#plotting test and predictions
plot_test_predictions(y_test_df['total_amount'], predictions_xgb )
#plotting forecast
plot_forecast(y_train_df['total_amount'], y_test_df['total_amount'], xgb_forecast_df['Forecast'] )
Early stopping is a technique used to stop training when the loss on validation dataset starts increase (in the case of minimizing the loss). That’s why to train a model (any model, not only Xgboost) you need two separate datasets:
training data for model fitting, validation data for loss monitoring and early stopping.
num_estimators= [i for i in range(100, 1010, 100)]
depth=[i for i in range(3, 15, 2)]
learning=[0.0001, 0.001, 0.01, 0.1, 0.2, 0.3]
XGB_summary=pd.DataFrame(columns=['n_estimators', 'max_depth', 'learning_rate', 'MAPE', 'RMSE' ])
for i in num_estimators:
for j in depth:
for k in learning:
model_reg= xgb.XGBRegressor(n_estimators=i, max_depth=j, learning_rate=k )
model_reg.fit(X_train, y_train)
predictions_xgb = model_reg.predict(X_test)
mape_xgb=np.mean(np.abs((y_test - predictions_xgb)/y_test))*100
rmse_xgb = np.sqrt(np.mean((y_test-predictions_xgb)**2))
data_param={'n_estimators': i,
'max_depth': j,
'learning_rate': k,
'MAPE': mape_xgb,
'RMSE' : rmse_xgb }
XGB_summary=pd.concat([XGB_summary, pd.DataFrame(data_param, columns=XGB_summary.columns, index=[1])], ignore_index=True)
XGB_summary.sort_values(by ='MAPE', ascending=True).head(10)
| n_estimators | max_depth | learning_rate | MAPE | RMSE | |
|---|---|---|---|---|---|
| 62 | 200 | 11 | 0.010 | 47.583272 | 12349.605940 |
| 56 | 200 | 9 | 0.010 | 47.687938 | 12271.166307 |
| 68 | 200 | 13 | 0.010 | 47.884805 | 12237.428437 |
| 50 | 200 | 7 | 0.010 | 48.130152 | 11923.555634 |
| 38 | 200 | 3 | 0.010 | 48.907972 | 12070.128655 |
| 44 | 200 | 5 | 0.010 | 49.357951 | 12115.267006 |
| 2 | 100 | 3 | 0.010 | 50.211866 | 16374.092326 |
| 325 | 1000 | 3 | 0.001 | 50.265213 | 16415.906460 |
| 98 | 300 | 11 | 0.010 | 50.517951 | 11576.892123 |
| 20 | 100 | 9 | 0.010 | 50.592040 | 16444.078331 |
#instantiate the XGBoost Model
reg_tuned = xgb.XGBRegressor(n_estimators=200, max_depth=11, learning_rate=0.010)
reg_tuned.fit(X_train, y_train,
eval_set=[(X_train, y_train), (X_test, y_test)],
early_stopping_rounds=50,
verbose=True)
[0] validation_0-rmse:26442.40689 validation_1-rmse:34787.93310 [1] validation_0-rmse:26209.35688 validation_1-rmse:34463.03673 [2] validation_0-rmse:25978.79956 validation_1-rmse:34141.76996 [3] validation_0-rmse:25750.58389 validation_1-rmse:33823.97676 [4] validation_0-rmse:25524.78552 validation_1-rmse:33509.75442 [5] validation_0-rmse:25301.16760 validation_1-rmse:33198.94082 [6] validation_0-rmse:25079.77943 validation_1-rmse:32891.58606 [7] validation_0-rmse:24860.69992 validation_1-rmse:32587.67999 [8] validation_0-rmse:24643.77319 validation_1-rmse:32287.11881 [9] validation_0-rmse:24429.09747 validation_1-rmse:31989.95224 [10] validation_0-rmse:24216.54912 validation_1-rmse:31696.06386 [11] validation_0-rmse:24006.18467 validation_1-rmse:31405.48561 [12] validation_0-rmse:23797.99852 validation_1-rmse:31118.18460 [13] validation_0-rmse:23591.86642 validation_1-rmse:30834.10295 [14] validation_0-rmse:23387.87887 validation_1-rmse:30553.24785 [15] validation_0-rmse:23185.99936 validation_1-rmse:30275.55099 [16] validation_0-rmse:22986.31690 validation_1-rmse:30001.02589 [17] validation_0-rmse:22788.61823 validation_1-rmse:29729.60046 [18] validation_0-rmse:22592.93534 validation_1-rmse:29461.28626 [19] validation_0-rmse:22399.44787 validation_1-rmse:29196.03476 [20] validation_0-rmse:22207.98835 validation_1-rmse:28933.82858 [21] validation_0-rmse:22018.54822 validation_1-rmse:28674.63562 [22] validation_0-rmse:21831.03774 validation_1-rmse:28417.34174 [23] validation_0-rmse:21645.51591 validation_1-rmse:28163.03740 [24] validation_0-rmse:21461.91231 validation_1-rmse:27912.70522 [25] validation_0-rmse:21280.00500 validation_1-rmse:27664.26236 [26] validation_0-rmse:21100.10934 validation_1-rmse:27419.72236 [27] validation_0-rmse:20921.87957 validation_1-rmse:27177.03914 [28] validation_0-rmse:20745.52585 validation_1-rmse:26938.19038 [29] validation_0-rmse:20570.94945 validation_1-rmse:26701.16711 [30] validation_0-rmse:20398.04795 validation_1-rmse:26466.95706 [31] validation_0-rmse:20227.06875 validation_1-rmse:26236.44456 [32] validation_0-rmse:20057.78653 validation_1-rmse:26007.74784 [33] validation_0-rmse:19890.32705 validation_1-rmse:25781.16257 [34] validation_0-rmse:19724.31002 validation_1-rmse:25557.86221 [35] validation_0-rmse:19560.00268 validation_1-rmse:25338.17908 [36] validation_0-rmse:19397.62813 validation_1-rmse:25119.55491 [37] validation_0-rmse:19236.92461 validation_1-rmse:24904.19614 [38] validation_0-rmse:19077.88207 validation_1-rmse:24690.78116 [39] validation_0-rmse:18919.92530 validation_1-rmse:24480.58844 [40] validation_0-rmse:18763.63192 validation_1-rmse:24272.29462 [41] validation_0-rmse:18608.88879 validation_1-rmse:24067.18176 [42] validation_0-rmse:18455.49952 validation_1-rmse:23864.60373 [43] validation_0-rmse:18304.13079 validation_1-rmse:23663.79853 [44] validation_0-rmse:18153.55303 validation_1-rmse:23466.16882 [45] validation_0-rmse:18004.90437 validation_1-rmse:23270.26841 [46] validation_0-rmse:17857.70871 validation_1-rmse:23077.50240 [47] validation_0-rmse:17711.82498 validation_1-rmse:22888.13969 [48] validation_0-rmse:17567.59839 validation_1-rmse:22700.15545 [49] validation_0-rmse:17424.80253 validation_1-rmse:22550.82469 [50] validation_0-rmse:17283.21041 validation_1-rmse:22368.35618 [51] validation_0-rmse:17143.25920 validation_1-rmse:22187.30562 [52] validation_0-rmse:17004.70458 validation_1-rmse:22009.43614 [53] validation_0-rmse:16867.36499 validation_1-rmse:21867.75564 [54] validation_0-rmse:16731.70784 validation_1-rmse:21727.95619 [55] validation_0-rmse:16596.89751 validation_1-rmse:21554.90482 [56] validation_0-rmse:16463.78455 validation_1-rmse:21417.84817 [57] validation_0-rmse:16331.65076 validation_1-rmse:21250.72184 [58] validation_0-rmse:16201.02451 validation_1-rmse:21085.01305 [59] validation_0-rmse:16071.52527 validation_1-rmse:20922.20417 [60] validation_0-rmse:15943.50136 validation_1-rmse:20792.73119 [61] validation_0-rmse:15816.82150 validation_1-rmse:20665.01793 [62] validation_0-rmse:15691.31325 validation_1-rmse:20506.72100 [63] validation_0-rmse:15567.04102 validation_1-rmse:20381.83116 [64] validation_0-rmse:15443.99639 validation_1-rmse:20227.56692 [65] validation_0-rmse:15321.85240 validation_1-rmse:20106.89217 [66] validation_0-rmse:15201.10348 validation_1-rmse:19958.11693 [67] validation_0-rmse:15081.59753 validation_1-rmse:19839.91768 [68] validation_0-rmse:14963.16697 validation_1-rmse:19723.37067 [69] validation_0-rmse:14846.09879 validation_1-rmse:19580.30464 [70] validation_0-rmse:14730.26958 validation_1-rmse:19467.76560 [71] validation_0-rmse:14615.08568 validation_1-rmse:19327.38623 [72] validation_0-rmse:14501.21882 validation_1-rmse:19218.49154 [73] validation_0-rmse:14388.36989 validation_1-rmse:19109.88533 [74] validation_0-rmse:14276.69872 validation_1-rmse:18976.00033 [75] validation_0-rmse:14166.19503 validation_1-rmse:18870.46572 [76] validation_0-rmse:14056.80944 validation_1-rmse:18738.45732 [77] validation_0-rmse:13948.65031 validation_1-rmse:18609.93224 [78] validation_0-rmse:13841.30553 validation_1-rmse:18509.66669 [79] validation_0-rmse:13734.98449 validation_1-rmse:18384.60054 [80] validation_0-rmse:13629.57902 validation_1-rmse:18287.22974 [81] validation_0-rmse:13525.27901 validation_1-rmse:18164.63157 [82] validation_0-rmse:13421.92806 validation_1-rmse:18070.52467 [83] validation_0-rmse:13319.52465 validation_1-rmse:17977.02946 [84] validation_0-rmse:13218.55327 validation_1-rmse:17859.24701 [85] validation_0-rmse:13118.26889 validation_1-rmse:17768.83911 [86] validation_0-rmse:13018.95249 validation_1-rmse:17655.67929 [87] validation_0-rmse:12920.64227 validation_1-rmse:17567.62619 [88] validation_0-rmse:12823.30537 validation_1-rmse:17481.21680 [89] validation_0-rmse:12726.91391 validation_1-rmse:17371.29177 [90] validation_0-rmse:12631.37714 validation_1-rmse:17281.04926 [91] validation_0-rmse:12536.85551 validation_1-rmse:17174.26383 [92] validation_0-rmse:12443.29047 validation_1-rmse:17093.05868 [93] validation_0-rmse:12350.15144 validation_1-rmse:17006.87951 [94] validation_0-rmse:12258.09208 validation_1-rmse:16903.27040 [95] validation_0-rmse:12166.96999 validation_1-rmse:16819.76365 [96] validation_0-rmse:12076.75682 validation_1-rmse:16721.71515 [97] validation_0-rmse:11987.44686 validation_1-rmse:16640.81315 [98] validation_0-rmse:11899.03491 validation_1-rmse:16539.02730 [99] validation_0-rmse:11811.05994 validation_1-rmse:16444.07839 [100] validation_0-rmse:11723.94070 validation_1-rmse:16363.08833 [101] validation_0-rmse:11637.72846 validation_1-rmse:16287.77211 [102] validation_0-rmse:11552.37156 validation_1-rmse:16196.95429 [103] validation_0-rmse:11467.99980 validation_1-rmse:16119.64984 [104] validation_0-rmse:11384.36999 validation_1-rmse:16025.07425 [105] validation_0-rmse:11301.44082 validation_1-rmse:15950.26797 [106] validation_0-rmse:11219.43078 validation_1-rmse:15880.50572 [107] validation_0-rmse:11138.22010 validation_1-rmse:15796.29055 [108] validation_0-rmse:11057.61760 validation_1-rmse:15725.09281 [109] validation_0-rmse:10977.98371 validation_1-rmse:15659.01188 [110] validation_0-rmse:10898.70991 validation_1-rmse:15581.78830 [111] validation_0-rmse:10820.39491 validation_1-rmse:15514.06740 [112] validation_0-rmse:10742.65640 validation_1-rmse:15440.82507 [113] validation_0-rmse:10666.17956 validation_1-rmse:15382.42824 [114] validation_0-rmse:10589.39176 validation_1-rmse:15355.46163 [115] validation_0-rmse:10513.52326 validation_1-rmse:15285.41233 [116] validation_0-rmse:10438.25301 validation_1-rmse:15259.22429 [117] validation_0-rmse:10363.90123 validation_1-rmse:15218.76998 [118] validation_0-rmse:10290.95735 validation_1-rmse:15145.17038 [119] validation_0-rmse:10217.63619 validation_1-rmse:15120.02686 [120] validation_0-rmse:10144.96168 validation_1-rmse:15081.18798 [121] validation_0-rmse:10073.02868 validation_1-rmse:15056.57990 [122] validation_0-rmse:10002.05458 validation_1-rmse:14992.92486 [123] validation_0-rmse:9931.40340 validation_1-rmse:14969.03352 [124] validation_0-rmse:9861.74419 validation_1-rmse:14932.18266 [125] validation_0-rmse:9792.59475 validation_1-rmse:14872.02306 [126] validation_0-rmse:9723.93520 validation_1-rmse:14849.13555 [127] validation_0-rmse:9656.96078 validation_1-rmse:14784.10380 [128] validation_0-rmse:9590.41285 validation_1-rmse:14749.39955 [129] validation_0-rmse:9523.40954 validation_1-rmse:14727.43322 [130] validation_0-rmse:9457.36124 validation_1-rmse:14705.22301 [131] validation_0-rmse:9392.41349 validation_1-rmse:14643.43824 [132] validation_0-rmse:9327.36778 validation_1-rmse:14610.02508 [133] validation_0-rmse:9262.75457 validation_1-rmse:14588.66120 [134] validation_0-rmse:9199.92630 validation_1-rmse:14529.46199 [135] validation_0-rmse:9136.54004 validation_1-rmse:14508.65367 [136] validation_0-rmse:9073.77204 validation_1-rmse:14488.00872 [137] validation_0-rmse:9011.66312 validation_1-rmse:14456.52299 [138] validation_0-rmse:8950.63522 validation_1-rmse:14388.22913 [139] validation_0-rmse:8889.89936 validation_1-rmse:14368.44752 [140] validation_0-rmse:8829.48412 validation_1-rmse:14348.82200 [141] validation_0-rmse:8769.64957 validation_1-rmse:14329.35112 [142] validation_0-rmse:8710.51041 validation_1-rmse:14299.73420 [143] validation_0-rmse:8652.33916 validation_1-rmse:14236.87844 [144] validation_0-rmse:8594.14828 validation_1-rmse:14218.19770 [145] validation_0-rmse:8536.16948 validation_1-rmse:14199.50159 [146] validation_0-rmse:8478.78551 validation_1-rmse:14171.19810 [147] validation_0-rmse:8422.53226 validation_1-rmse:14111.05759 [148] validation_0-rmse:8366.80395 validation_1-rmse:14052.01183 [149] validation_0-rmse:8311.67538 validation_1-rmse:13994.04734 [150] validation_0-rmse:8257.05180 validation_1-rmse:13937.15073 [151] validation_0-rmse:8202.28618 validation_1-rmse:13911.84688 [152] validation_0-rmse:8147.90859 validation_1-rmse:13860.61095 [153] validation_0-rmse:8094.59299 validation_1-rmse:13806.14647 [154] validation_0-rmse:8041.49040 validation_1-rmse:13752.70445 [155] validation_0-rmse:7989.09215 validation_1-rmse:13700.27097 [156] validation_0-rmse:7936.08901 validation_1-rmse:13648.83326 [157] validation_0-rmse:7883.43304 validation_1-rmse:13597.58855 [158] validation_0-rmse:7831.96698 validation_1-rmse:13547.32636 [159] validation_0-rmse:7780.57117 validation_1-rmse:13509.75361 [160] validation_0-rmse:7729.44356 validation_1-rmse:13461.16210 [161] validation_0-rmse:7679.26502 validation_1-rmse:13417.57930 [162] validation_0-rmse:7629.94661 validation_1-rmse:13370.65577 [163] validation_0-rmse:7579.86141 validation_1-rmse:13324.65789 [164] validation_0-rmse:7530.28642 validation_1-rmse:13279.57262 [165] validation_0-rmse:7481.88049 validation_1-rmse:13238.39473 [166] validation_0-rmse:7434.37414 validation_1-rmse:13194.90050 [167] validation_0-rmse:7385.94608 validation_1-rmse:13152.28461 [168] validation_0-rmse:7337.90167 validation_1-rmse:13120.30512 [169] validation_0-rmse:7290.29101 validation_1-rmse:13088.88966 [170] validation_0-rmse:7243.30999 validation_1-rmse:13048.37841 [171] validation_0-rmse:7197.05635 validation_1-rmse:13009.36187 [172] validation_0-rmse:7150.78157 validation_1-rmse:12977.69068 [173] validation_0-rmse:7104.66936 validation_1-rmse:12953.15663 [174] validation_0-rmse:7058.99607 validation_1-rmse:12929.00534 [175] validation_0-rmse:7014.27894 validation_1-rmse:12891.75835 [176] validation_0-rmse:6969.33258 validation_1-rmse:12874.22243 [177] validation_0-rmse:6924.65448 validation_1-rmse:12850.70751 [178] validation_0-rmse:6881.04893 validation_1-rmse:12815.17283 [179] validation_0-rmse:6836.81394 validation_1-rmse:12791.26579 [180] validation_0-rmse:6793.07659 validation_1-rmse:12774.96141 [181] validation_0-rmse:6750.38393 validation_1-rmse:12741.07349 [182] validation_0-rmse:6707.18895 validation_1-rmse:12725.32849 [183] validation_0-rmse:6665.25414 validation_1-rmse:12692.58440 [184] validation_0-rmse:6622.79948 validation_1-rmse:12670.40666 [185] validation_0-rmse:6580.84120 validation_1-rmse:12649.82559 [186] validation_0-rmse:6538.95425 validation_1-rmse:12635.21367 [187] validation_0-rmse:6498.38443 validation_1-rmse:12604.43963 [188] validation_0-rmse:6457.16614 validation_1-rmse:12583.59020 [189] validation_0-rmse:6417.14080 validation_1-rmse:12554.43821 [190] validation_0-rmse:6376.47505 validation_1-rmse:12540.82844 [191] validation_0-rmse:6336.70049 validation_1-rmse:12512.16896 [192] validation_0-rmse:6297.19847 validation_1-rmse:12487.46011 [193] validation_0-rmse:6257.52845 validation_1-rmse:12476.93487 [194] validation_0-rmse:6218.25615 validation_1-rmse:12466.67714 [195] validation_0-rmse:6179.37799 validation_1-rmse:12441.48705 [196] validation_0-rmse:6141.09577 validation_1-rmse:12415.26651 [197] validation_0-rmse:6103.26060 validation_1-rmse:12397.83363 [198] validation_0-rmse:6065.61090 validation_1-rmse:12373.72518 [199] validation_0-rmse:6029.12181 validation_1-rmse:12349.60594
XGBRegressor(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.01, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=11, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
n_estimators=200, n_jobs=None, num_parallel_tree=None,
predictor=None, random_state=None, ...)
# #evaluating on test set
predictions_xgb_tuned = reg_tuned.predict(X_test)
#calculating MAPE and RMSE
mape_xgb_tuned=mape_metrics(y_test, predictions_xgb_tuned )
rmse_xgb_tuned=rmse_metrics(y_test, predictions_xgb_tuned )
print(f'Root Mean Squared Error | RMSE: {rmse_xgb_tuned}')
print(f'Mean Absolute Percentage Error | MAPE: {mape_xgb_tuned}')
Root Mean Squared Error | RMSE: 12349.61 Mean Absolute Percentage Error | MAPE: 47.58
#plot feature importance
plot_importance(reg_tuned, height=0.9)
<AxesSubplot: title={'center': 'Feature importance'}, xlabel='F score', ylabel='Features'>
#plotting test and predictions
plot_test_predictions(y_test_df['total_amount'], predictions_xgb_tuned)
We can use univariate LSTM to predict the next value in out time series. We will use only one step prediction, meaning we will use only one previous observation to predict the next value. Using this we will have only one feature. This one step window is highly important when we want to forecast the value for next day. This is not the case with our sales forecasting for ecommerce, yet we want to try this way.
Like previous models of SARIMA, FB Prophet and XG Boost, LSTM also takes the input in certain format. It takes a series of data with number of observations X number of timesteps X number of features.
We need to scale the data in order to apply it to LSTM model.
#scaling the date
sc = MinMaxScaler(feature_range = (0,1))
training_scaled = sc.fit_transform(train_df)
test_scaled = sc.transform(test_df)
fig, (ax0,ax1) = plt.subplots(nrows=1, ncols=2, figsize=(20,8))
ax0.set_title('Original Distributions')
ax1.set_title('Distributions after MinMax scaler')
ax0.plot(train_df['total_amount'])
ax1.plot(pd.DataFrame(training_scaled,columns=['total_amount'], index=train_df.index)['total_amount'])
[<matplotlib.lines.Line2D at 0x1d49266df40>]
#define the lookback function for creating y
def lookback(df, window):
X = list()
Y = list()
for i in range(window,len(df)):
X.append(df[i-window:i, 0])
Y.append(df[i,0])
X,Y = np.array(X),np.array(Y)
return X, Y
#creating X and y using lookback function
X_train_lstm, y_train_lstm = lookback(training_scaled, 1)
X_test_lstm, y_test_lstm = lookback(test_scaled, 1)
#checking shape of X
X_train_lstm.shape
(484, 1)
#reshaping X for LSTM
#number of observations, timesteps and features
X_train_lstm = np.reshape(X_train_lstm, (X_train_lstm.shape[0],X_train_lstm.shape[1],1))
X_test_lstm = np.reshape(X_test_lstm, (X_test_lstm.shape[0],X_test_lstm.shape[1],1))
X_train_lstm.shape
(484, 1, 1)
Let us apply the LSM model with three layers. The starting layer has 100 nodes, second layer has 50 nodes and outut layer has 1 node. We will be using Keras Early stopping function to prevent it from overfitting. We have also chosen the batch size equal to 1
# random seeds for reproducibility
np.random.seed(123)
tf.random.set_seed(123)
model = Sequential()
# The input of LSTM layer has a shape of (num_timesteps, num_features)
model.add(LSTM(100, return_sequences=True, activation='relu', input_shape=(X_train_lstm.shape[1], X_train_lstm.shape[2])))
model.add(Dropout(0.2))
model.add(LSTM(50, activation='relu' ))
model.add(Dropout(0.2))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
model.summary()
history = model.fit(X_train_lstm, y_train_lstm, epochs=50, batch_size=16, shuffle=False,
validation_data=(X_test_lstm, y_test_lstm), verbose=1,
callbacks = [EarlyStopping(monitor='val_loss', mode='min', patience=20, verbose=1)])
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
lstm (LSTM) (None, 1, 100) 40800
dropout (Dropout) (None, 1, 100) 0
lstm_1 (LSTM) (None, 50) 30200
dropout_1 (Dropout) (None, 50) 0
dense (Dense) (None, 1) 51
=================================================================
Total params: 71,051
Trainable params: 71,051
Non-trainable params: 0
_________________________________________________________________
Epoch 1/50
31/31 [==============================] - 5s 28ms/step - loss: 0.0103 - val_loss: 0.0101
Epoch 2/50
31/31 [==============================] - 0s 8ms/step - loss: 0.0054 - val_loss: 0.0056
Epoch 3/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0065 - val_loss: 0.0059
Epoch 4/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0062 - val_loss: 0.0058
Epoch 5/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0061 - val_loss: 0.0055
Epoch 6/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0059 - val_loss: 0.0053
Epoch 7/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0058 - val_loss: 0.0050
Epoch 8/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0058 - val_loss: 0.0047
Epoch 9/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0055 - val_loss: 0.0044
Epoch 10/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0054 - val_loss: 0.0041
Epoch 11/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0051 - val_loss: 0.0038
Epoch 12/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0051 - val_loss: 0.0037
Epoch 13/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0047 - val_loss: 0.0033
Epoch 14/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0047 - val_loss: 0.0031
Epoch 15/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0045 - val_loss: 0.0029
Epoch 16/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0045 - val_loss: 0.0030
Epoch 17/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0043 - val_loss: 0.0028
Epoch 18/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0043 - val_loss: 0.0028
Epoch 19/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0042 - val_loss: 0.0027
Epoch 20/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0041 - val_loss: 0.0026
Epoch 21/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0039 - val_loss: 0.0026
Epoch 22/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0044 - val_loss: 0.0028
Epoch 23/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0037 - val_loss: 0.0025
Epoch 24/50
31/31 [==============================] - 0s 8ms/step - loss: 0.0037 - val_loss: 0.0025
Epoch 25/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0041 - val_loss: 0.0026
Epoch 26/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0037 - val_loss: 0.0025
Epoch 27/50
31/31 [==============================] - 0s 7ms/step - loss: 0.0037 - val_loss: 0.0024
Epoch 28/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0040 - val_loss: 0.0024
Epoch 29/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0037 - val_loss: 0.0024
Epoch 30/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0035 - val_loss: 0.0024
Epoch 31/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0035 - val_loss: 0.0024
Epoch 32/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0038 - val_loss: 0.0024
Epoch 33/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0033 - val_loss: 0.0024
Epoch 34/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0033 - val_loss: 0.0023
Epoch 35/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0035 - val_loss: 0.0024
Epoch 36/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0033 - val_loss: 0.0023
Epoch 37/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0033 - val_loss: 0.0023
Epoch 38/50
31/31 [==============================] - 0s 5ms/step - loss: 0.0033 - val_loss: 0.0023
Epoch 39/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0032 - val_loss: 0.0023
Epoch 40/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0032 - val_loss: 0.0023
Epoch 41/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0030 - val_loss: 0.0023
Epoch 42/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0033 - val_loss: 0.0023
Epoch 43/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0032 - val_loss: 0.0023
Epoch 44/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0030 - val_loss: 0.0023
Epoch 45/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0031 - val_loss: 0.0023
Epoch 46/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0033 - val_loss: 0.0023
Epoch 47/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0032 - val_loss: 0.0023
Epoch 48/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0029 - val_loss: 0.0024
Epoch 49/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0034 - val_loss: 0.0023
Epoch 50/50
31/31 [==============================] - 0s 6ms/step - loss: 0.0033 - val_loss: 0.0023
#calaculating predictions
prediction_lstm = model.predict(X_test_lstm)
4/4 [==============================] - 0s 3ms/step
#predicting and inversing the predictions to original time scale.
prediction_inverse = sc.inverse_transform(prediction_lstm.reshape(-1, 1))
y_test_inverse = sc.inverse_transform(y_test_lstm.reshape(-1, 1))
prediction2_inverse = np.array(prediction_inverse[:,0][1:])
y_test2_inverse = np.array(y_test_inverse[:,0])
#Evaluating the LSTM model.
y_test2_inverse_without_last = y_test2_inverse[:-1]
rmse_lstm = rmse_metrics(y_test2_inverse_without_last, prediction2_inverse)
mape_lstm = mape_metrics(y_test2_inverse_without_last, prediction2_inverse)
print(f"Root Mean Squared Error | RMSE: {rmse_lstm}")
print(f"Mean Absolute Percentage Error | MAPE: {mape_lstm}")
Root Mean Squared Error | RMSE: 3638.94 Mean Absolute Percentage Error | MAPE: 9.94
#plotting performance of the model
epochs = range(1, 51)
plt.figure(figsize=(15, 5))
plt.plot(epochs, history.history["loss"], label="training", marker="o")
plt.plot(epochs, history.history["val_loss"], label="validation", marker="o")
plt.xlabel("Epochs")
plt.xticks(epochs)
plt.ylabel("Loss")
plt.legend()
plt.show()
Looking at the training metrics, we see that the validation and training loss decrease in parallel (so we are not overfitting):
test_df[: -2].index
y_test2_inverse_without_last=pd.DataFrame(y_test2_inverse_without_last, columns=['total_amount'], index= test_df[: -2].index)
plot_test_predictions(y_test2_inverse_without_last['total_amount'], prediction2_inverse )
Note: We have applied only one step time series forecasting which predicts the observation at the next time step.
Although we are getting very good result but still I am reluctant to proceed ahead with this model because we have limited sales data. We need to get test it on more data to see how it performs.
In an attempt to increase the data points, I tried to resample the dataset at hourly level and discovered that I got a lot of zero values at certain time of day because no order was placed during that time. I tried applying SAIMA model but it resulted in negative predictions and decreasing trend. Upon further reading and consulting, found that we will need to do some transfromations or apply differnt approaches to handle such data. Therfore, I limited myself to the daily data only.
Here is the summary of all the modela we tested:
| Model | RMSE | MAPE |
|---|---|---|
| SARIMA(1,1,1)(0,1,1)(7) | 13810.59 | 68.99 |
| SARIMAX(1,1,2)(0,1,1)(7) Including impact of holidays | 13312.72 | 65.66 |
| Baseline Prophet | 27437.77 | 71.78 |
| Baseline Prophet with holiday | 15393.64 | 77.88 |
| Tuned Prophet with holiday | 12142.69 | 51.45 |
| XGBoost Regression including Holiday | 11493.59 | 52.18 |
| Tuned XGBoost Regression including Holiday | 12349.61 | 47.58 |
| LSTM (one step Prediction) | 2803.14 | 9.37 |
#plotting all the models together
fig = go.Figure()
fig.add_trace(go.Scatter(x=train_df.index, y=train_df['total_amount'].values, mode='lines', name="Train"))
fig.add_trace(go.Scatter(x=test_df.index, y=test_df['total_amount'].values, mode='lines', name="Test"))
fig.add_trace(go.Scatter(x=test_df.index, y=sarimax_predictions.values, mode='lines', name="SARIMAX"))
fig.add_trace(go.Scatter(x=test_df.index, y=forecast_tuned[-121:]['yhat'].values, mode='lines', name="FB Prophet tuned"))
fig.add_trace(go.Scatter(x=test_df.index, y=predictions_xgb_tuned, mode='lines', name="XGB Tuned"))
fig.add_trace(go.Scatter(x=test_df.index[: -2], y=prediction2_inverse, mode='lines', name="LSTM"))
fig.update_xaxes(rangeslider_visible=True)
fig.update_layout(
yaxis_title="Revenue amount",
xaxis_title="Date",
title="Daily Sales amount and forecast using different models"
)
# fig.write_html("result.html")
fig.show()
#plotting all the models together
fig = go.Figure()
# fig.add_trace(go.Scatter(x=train_df.index, y=train_df['total_amount'].values, mode='lines', name="Train"))
fig.add_trace(go.Scatter(x=test_df.index, y=test_df['total_amount'].values, mode='lines', name="Test"))
fig.add_trace(go.Scatter(x=test_df.index, y=sarimax_predictions.values, mode='lines', name="SARIMAX"))
fig.add_trace(go.Scatter(x=test_df.index, y=forecast_tuned[-121:]['yhat'].values, mode='lines', name="FB Prophet tuned"))
fig.add_trace(go.Scatter(x=test_df.index, y=predictions_xgb_tuned, mode='lines', name="XGB Tuned"))
fig.add_trace(go.Scatter(x=test_df.index[: -2], y=prediction2_inverse, mode='lines', name="LSTM"))
fig.update_xaxes(rangeslider_visible=True)
fig.update_layout(
yaxis_title="Revenue amount",
xaxis_title="Date",
title="Daily Sales amount and forecast using different models"
)
fig.write_html("result.html")
fig.show()
We will be choosing the FB Prophet as the best performing model because it is able to achieve considerable MAPE. It was able to capture the trend, seasonality and peaks within week and month.
On the other hand tuned XGB also gave good result but upon plotting the graph we did not see it capturing the peaks within week and trend. For LSTM we may need to do futher resarch and try testing it with more data points to be able to proceed ahead with it.
#saving the model as pickle file
import joblib
joblib.dump(fb_tuned, "forecasting_model.pkl")
['forecasting_model.pkl']